Introduction
This file contains practice and exam questions suited to test your skills and understanding. It also illustrates the procedure of our mid-term exam (on June 4, 2018) and final exam (on July 16, 2018).
This exam contains a total of XXX tasks and a maximum score of YYY points.
Course coordinates
- Course Data Science for Psychologists (ds4psy).
- Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
- Spring/summer 2018: Mondays, 13:30–15:00, C511 (from 2018.04.16 to 2018.07.16)
- Links to ZeUS and Ilias
Preparation and response format
0. Please answer the following questions by creating a single R script (or an R-Markdown file .Rmd) that contains all your code and answers and meets the following criteria:
Layout issues:
Include a header that contains your name, student ID, this course, and today’s date.
Load the R packages of the
tidyverse.Structure your file clearly by labeling the current task (e.g.,
# Task 1: -----) and subtask (e.g.,# (a) ...:) and by leaving blank lines between all tasks and subtasks.
Here’s a layout template that you can copy and adapt:
## Final exam | Data science for psychologists (Summer 2018)
## Name: ... | Student ID: ...
## 2018 07 16
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## Preparations: -----
library(tidyverse)
## Task 1: -----
# (a) Save data as tibble and inspect data:
pg <- as_tibble(PlantGrowth)
pg
## Answer: The PlantGrowth data contains 30 cases (rows) and 2 variables (columns).
## (b): ...
## ...
## Task X: -----
## (a) ...
## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
## End of file. ----- Save your script (regularly) as
Lastname_Firstname_midTerm_180604.R(replacingLastnameandFirstnameby your names).When asked for numbers or interpretations, include short answers as comments in your script. However, when asked for quantitative summaries containing more than 2 numbers (e.g., descriptive statistics of a dataset) simply print your results (e.g., the output of a
dplyrpipe) in your code.Submit your script as an attachment to an email (with the subject header “ds4psy: final exam”) to h.neth@uni.kn by Wednesday (July 18th, 2018).
3: Data visualisation
Key skills
The chapter introduces visualizations (with ggplot), but not yet data transformations (with dplyr). Key skills to be acquired in this context include creating:
- scatterplots (
geom_point) and trendlines (geom_lineandgeom_smooth), - grouping (via aesthetic mappings like
color,shape,size, etc.), - facetting (
facet_wrapandfacet_grid), - bar charts (
geom_bar) for given values (stat = "identity"), frequency counts, or proportions, for different bar positions (stack,fill, anddodge),
- boxplots (
geom_boxplot), and - adjusting visual aspects (colors, shapes, themes), labels (e.g., plot titles and captions), and coordinates (
coord_cartesian,coord_flip, andcoord_polar).
Example tasks
Task 1: Growing plants
The PlanthGrowth data (contained in R datasets) reports the results from an experiment that compares growth yields (measured by the dried weight of plants) obtained under 2 treatments vs. a control condition.
1a. Save the PlantGrowth data as a tibble and inspect its dimensions. (1 point)
1b. Use a dplyr pipe to compute the number of observations (rows) in each group and some key descriptives (their mean, median, and standard deviation). (2 points)
1c. Use ggplot to create a graph that shows the medians and raw values of plant weight by group. (2 points)
Hints: Use 2 different geoms to show both medians and raw values in the same plot. The order of layers is determined by the order of geom commands.
# ?datasets::PlantGrowth
# (a) Save as tibble and inspect data:
pg <- as_tibble(PlantGrowth)
pg # => 30 cases (rows) x 2 variables (columns)
#> # A tibble: 30 x 2
#> weight group
#> <dbl> <fct>
#> 1 4.17 ctrl
#> 2 5.58 ctrl
#> 3 5.18 ctrl
#> 4 6.11 ctrl
#> 5 4.50 ctrl
#> 6 4.61 ctrl
#> 7 5.17 ctrl
#> 8 4.53 ctrl
#> 9 5.33 ctrl
#> 10 5.14 ctrl
#> # ... with 20 more rows
# (b) Compute number of observations by group, their mean, median and standard deviation:
pg %>%
group_by(group) %>%
summarise(count = n(),
mn_weight = mean(weight),
md_weight = median(weight),
sd_weight = sd(weight)
)
#> # A tibble: 3 x 5
#> group count mn_weight md_weight sd_weight
#> <fct> <int> <dbl> <dbl> <dbl>
#> 1 ctrl 10 5.03 5.15 0.583
#> 2 trt1 10 4.66 4.55 0.794
#> 3 trt2 10 5.53 5.44 0.443
# (c) Plot the median and raw values of weight by group:
ggplot(pg, aes(x = group, y = weight)) +
# geom_violin() +
geom_boxplot(aes(fill = group)) +
geom_point(aes(shape = group), alpha = 2/3, size = 4, position = "jitter") +
## Pimping plot:
labs(title = "Plant weight by group", x = "Group", y = "Weight",
caption = "Data from datasets::PlantGrowth.") +
scale_fill_manual(values = c("grey75", "gold", "steelblue3")) +
theme_bw()
pt <- pt + 5 # increment point totalTask 2: Time to sleep
The sleep data (contained in R datasets) shows the effect of 2 sleep-promoting drugs (as an increase in hours of sleep compared to a control group) for 10 patients.
2a. Save the sleep data as a tibble sp and inspect its dimensions. (1 point)
2b. Use a dplyr pipe to compute the number of observations (rows) by group
and key descriptives (their mean, median, and standard deviation). (2 points)
2c. Use ggplot to create a graph that shows the medians and raw values of extra sleep time by group. (2 points)
Hints: Use 2 different geoms to show both medians and raw values in the same plot. The order of layers is determined by the order of geom commands.
2d. Reformat the sleep data in sp so that the 2 groups appear in 2 lines (rows) and 10 subject IDs as 10 columns. (1 point)
Hints: This implies using the tidyr::spread command to change a long dataset into a wider format.
# ?datasets::sleep
# (a) Save data as tibble and inspect:
sp <- as_tibble(sleep)
sp # => 20 cases (rows) x 3 variables (columns)
#> # A tibble: 20 x 3
#> extra group ID
#> <dbl> <fct> <fct>
#> 1 0.700 1 1
#> 2 -1.60 1 2
#> 3 -0.200 1 3
#> 4 -1.20 1 4
#> 5 -0.100 1 5
#> 6 3.40 1 6
#> 7 3.70 1 7
#> 8 0.800 1 8
#> 9 0 1 9
#> 10 2.00 1 10
#> 11 1.90 2 1
#> 12 0.800 2 2
#> 13 1.10 2 3
#> 14 0.100 2 4
#> 15 -0.100 2 5
#> 16 4.40 2 6
#> 17 5.50 2 7
#> 18 1.60 2 8
#> 19 4.60 2 9
#> 20 3.40 2 10
# (b) Compute the number of observations and descriptives of extra time by group:
sp %>%
group_by(group) %>%
summarize(n = n(),
md = median(extra),
mn = mean(extra),
sd = sd(extra))
#> # A tibble: 2 x 5
#> group n md mn sd
#> <fct> <int> <dbl> <dbl> <dbl>
#> 1 1 10 0.350 0.750 1.79
#> 2 2 10 1.75 2.33 2.00
# (c) Visualize the raw values and averages by group:
ggplot(sp, aes(x = group, y = extra)) +
geom_boxplot(aes(fill = group)) +
# geom_violin() +
geom_point(aes(shape = group), size = 4, alpha = 2/3, position = "jitter") +
## Pimping plot:
labs(title = "Extra sleep time by treatment group",
x = "Treatment group", y = "Extra sleep time",
caption = "Data from datasets::sleep.") +
scale_fill_manual(values = c("gold1", "steelblue3")) +
theme_bw()
# (d) Reformat data so that the 2 groups appear in 2 lines,
# and 10 subject IDs as 10 columns:
sp %>%
spread(key = ID, value = extra)
#> # A tibble: 2 x 11
#> group `1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
#> * <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.700 -1.60 -0.200 -1.20 -0.100 3.40 3.70 0.800 0 2.00
#> 2 2 1.90 0.800 1.10 0.100 -0.100 4.40 5.50 1.60 4.60 3.40
pt <- pt + 6 # increment point totalTask 3: Dietary chicks
The ChickWeight data (contained in R datasets) contains the results of an experiment that measures the effects of Diet on the early growth of chicks.
3a. Save the ChickWeight data as a tibble and inspect its dimensions. (1 point)
3b. Create a line plot showing the weight development of each indivdual chick (on the y-axis) over Time (on the x-axis) for each Diet (in 4 different facets). (2 points)
# ?datasets::ChickWeight
# (a) Save data as tibble and inspect:
cw <- as_tibble(ChickWeight)
cw # => 578 observations (rows) x 4 variables (columns)
#> # A tibble: 578 x 4
#> weight Time Chick Diet
#> * <dbl> <dbl> <ord> <fct>
#> 1 42.0 0 1 1
#> 2 51.0 2.00 1 1
#> 3 59.0 4.00 1 1
#> 4 64.0 6.00 1 1
#> 5 76.0 8.00 1 1
#> 6 93.0 10.0 1 1
#> 7 106 12.0 1 1
#> 8 125 14.0 1 1
#> 9 149 16.0 1 1
#> 10 171 18.0 1 1
#> # ... with 568 more rows
# (b) Scatter and/or line plot showing the weight development of each chick (on the y-axis)
# over Time (on the x-axis) for each Diet (as different facets):
ggplot(cw, aes(x = Time, y = weight, group = Diet)) +
facet_wrap(~Diet) +
geom_point(alpha = 1/2) +
geom_line(aes(group = Chick)) +
geom_smooth(aes(color = Diet)) +
labs(title = "Chick weight by time for different diets", x = "Time (number of days)", y = "Weight (in gm)",
caption = "Data from datasets::ChickWeight.") +
theme_bw()3c. The following bar chart shows the number of chicks per Diet over Time.
We see that the initial Diet groups contain a different numbers of chicks and some chicks drop out over Time:
# (c) Bar plot showing the number (count) of chicks per diet over time:
ggplot(cw, aes(x = Time, fill = Diet)) +
geom_bar(position = "dodge") +
labs(title = "Number of chicks per diet over time", x = "Time (number of days)", y = "Number",
caption = "Data from datasets::ChickWeight.") +
theme_bw()Instead of re-creating this plot, create a table (or tibble) that shows the same (4 x 12 = 48) data points in 4 rows (for the 4 different types of Diet) and 12 columns (for the different Time points). (2 points)
Hints: In a first step, count the number of chicks per Diet and Time. Next, spread the results into a wider format (to show the time points as different columns).
# (c) Re-create the counts of chicks per diet over time numerically (using `dplyr` and `tidyr`).
# as a table: How many chicks are there per diet over time?
cw %>%
group_by(Diet, Time) %>%
count() %>%
spread(key = Time, value = n)
#> # A tibble: 4 x 13
#> # Groups: Diet [4]
#> Diet `0` `2` `4` `6` `8` `10` `12` `14` `16` `18` `20`
#> * <fctr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 20 20 19 19 19 19 19 18 17 17 17
#> 2 2 10 10 10 10 10 10 10 10 10 10 10
#> 3 3 10 10 10 10 10 10 10 10 10 10 10
#> 4 4 10 10 10 10 10 10 10 10 10 10 9
#> # ... with 1 more variables: `21` <int>
## Not asked:
# (x) Plot the weight of each individual chick
# and the Median weight per diet at the end (Time = 21).
# **Hint:** Filter data for the maximum time and
# combine 2 geoms: A boxplot and a scatterplot.
cw %>%
filter(Time == 21) %>%
ggplot(., aes(x = Diet, y = weight, fill = Diet)) +
geom_boxplot() +
geom_point(size = 4, alpha = 1/2) +
coord_flip() +
theme_bw()
pt <- pt + 5 # increment point totalTask 4: Tabular TBC
Using tidyr::table1 data.
The tidyr::table1 shows the number of TB cases for 3 countries and 2 years.
4a. Plot a bar chart that shows the number of cases per country (on the y-axis) as a function of the year (on the x-axis). (2 points)
4b. Format the bars (showing cases per country) in different colors. (1 point)
4c. Provide a suitable plot title and a caption noting the data source. (1 point)
4d. Label each bar with the number of cases. (1 point)
Task 5: Party plots
The following table provides the percentage share of 2 major parties on the last 2 general elections of Germany (based on this link):
| Party: | Share 2013: | Share 2017: |
|---|---|---|
| CDU/CSU | 41.5% | 33% |
| SPD | 25.7% | 20.5% |
| Others | ? |
? |
5a. Create a tibble de that contains this data and the missing (?) values for all other parties so that all shares of an election add up to 100%. (2 points)
5b. Convert your de table into a “tidy” table saved as de_2. (2 points)
Hints: Use tidyr::gather to list the values of all election results in 1 variable called share and make sure that de_2 contains a separate variable (column) that specifies the election year.
5c. Visualize and contrast the election results by a bar chart that contains 2 bars (representing the 2 elections) and the party’s share of votes (as the proportions of each bar). (2 points)
Hints: As the data in de_2 already contains the identity of the values which you want to plot, there is no need to count anything. Showing multiple values in one bar is called a “stack”.
## (a) Create a tibble with the data:
de <- tibble(
party = c("CDU/CSU", "SPD", "Others"),
share_2013 = c((.341 + .074), .257, (1 - (.341 + .074) - .257)),
share_2017 = c((.268 + .062), .205, (1 - (.268 + .062) - .205))
)
de$party <- factor(de$party, levels = c("CDU/CSU", "SPD", "Others")) # optional
de
#> # A tibble: 3 x 3
#> party share_2013 share_2017
#> <fct> <dbl> <dbl>
#> 1 CDU/CSU 0.415 0.330
#> 2 SPD 0.257 0.205
#> 3 Others 0.328 0.465
## Check that columns add to 100:
sum(de$share_2013) # => 1 (qed)
#> [1] 1
sum(de$share_2017) # => 1 (qed)
#> [1] 1
## (b) Converting de into a tidy data table:
de_2 <- de %>%
gather(share_2013:share_2017, key = "election", value = "share") %>%
separate(col = "election", into = c("dummy", "year")) %>%
select(year, party, share)
de_2
#> # A tibble: 6 x 3
#> year party share
#> * <chr> <fct> <dbl>
#> 1 2013 CDU/CSU 0.415
#> 2 2013 SPD 0.257
#> 3 2013 Others 0.328
#> 4 2017 CDU/CSU 0.330
#> 5 2017 SPD 0.205
#> 6 2017 Others 0.465
## Note that year is of type character, which could be changed by:
# de_2$year <- parse_integer(de_2$year)
## (c) Bar chart showing proportions for each election:
ggplot(de_2, aes(x = year, y = share, fill = party)) +
## (A) 1 bar per election (position = "stack"):
geom_bar(stat = "identity", position = "stack") + # 1 bar per election
## (B) 3 bars per election (position = "dodge"):
# geom_bar(stat = "identity", position = "dodge") + # 3 bars next to each other
## Pimping plot:
scale_fill_manual(values = c("black", "red3", "gold")) + # optional
labs(title = "Partial results of the German general elections 2013 and 2017",
x = "Year of election", y = "Share of votes",
caption = "Data from www.bundeswahlleiter.de.") +
theme_classic()
pt <- pt + 6 # increment point totalTasks
- Create a data frame or tibble
- Plot bar chart (with
stat = "identity") - Create pie chart (with
coord_polar)
5: Data transformation
library(tidyverse) # dplyr
library(nycflights13) # dataKey skills
This chapter illustrates the basic table-manipulation tools (verbs) of dplyr. Key skills conveyed include transforming data by using essential dplyr commands:
filterandarrangecases (rows);selectand re-arranging variables (columns);- computing and adding new variables (with
mutateandtransmute); - computing counts and descriptives of group aggregates (by
group_by,summarise, and functions); - computing new group-level variables (by grouped mutates).
Example tasks
Task 6: R wars
Let’s tackle the universe with the tidyverse by uncovering some facts about the dplyr::starwars dataset. Answer the following questions by using pipes of basic dplyr commands (i.e., arranging, filtering, selecting, grouping, counting, summarizing).
6a. Save the tibble dplyr::starwars as sw and report its dimensions. (1 point)
6b. Missing values and known unknowns:
How many missing (
NA) values doesswcontain? (1 point)Which individuals come from an unknown (missing)
homeworldbut have a knownbirth_yearor knownmass? (1 point)
6c. Gender issues:
How many humans are contained in
swoverall and by gender? (1 point)How many and which individuals in
sware neither male nor female? (1 point)Of which species in
swexist at least 2 different gender values? (1 point)
6d. Popular homes and heights:
From which
homeworlddo the most indidividuals (rows) come from? (1 point)What is the mean
heightof all individuals with orange eyes from the most popular homeworld? (1 point)
6e. Seize and mass issues:
Compute the median, mean, and standard deviation of
heightfor all droids. (1 point)Compute the average height and mass by species and save the result as
h_m. (1 point)Sort
h_mto list the 3 species with the smallest individuals (in terms of mean height). (1 point)Sort
h_mto list the 3 species with the heaviest individuals (in terms of median mass). (1 point)
# library(tidyverse)
# ?dplyr::starwars
## (a) Basic data properties: ----
sw <- dplyr::starwars
dim(sw) # => 87 rows (denoting individuals) x 13 columns (variables)
#> [1] 87 13
## Missing data: -----
## (+) How many missing data points?
sum(is.na(sw)) # => 101 missing values.
#> [1] 101
# (+) Which individuals come from an unknown (missing) homeworld
# but have a known birth_year or mass?
sw %>%
filter(is.na(homeworld), !is.na(mass) | !is.na(birth_year))
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Yoda 66 17.0 white green brown 896 male
#> 2 IG-88 200 140 none metal red 15.0 none
#> 3 Qui-Gon … 193 89.0 brown fair blue 92.0 male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## (x) Which variable (column) has the most missing values?
colSums(is.na(sw)) # => birth_year has 44 missing values
#> name height mass hair_color skin_color eye_color
#> 0 6 28 5 0 0
#> birth_year gender homeworld species films vehicles
#> 44 3 10 5 0 0
#> starships
#> 0
colMeans(is.na(sw)) # (amounting to 50.1% of all cases).
#> name height mass hair_color skin_color eye_color
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000
#> birth_year gender homeworld species films vehicles
#> 0.50574713 0.03448276 0.11494253 0.05747126 0.00000000 0.00000000
#> starships
#> 0.00000000
## (x) Replace all missing values of `hair_color` (in the variable `sw$hair_color`) by "bald":
# sw$hair_color[is.na(sw$hair_color)] <- "bald"
## (c) Gender issues: -----
# (+) How many humans are there of each gender?
sw %>%
filter(species == "Human") %>%
group_by(gender) %>%
count()
#> # A tibble: 2 x 2
#> # Groups: gender [2]
#> gender n
#> <chr> <int>
#> 1 female 9
#> 2 male 26
## Answer: 35 Humans in total: 9 females, 26 male.
# (+) How many and which individuals are neither male nor female?
sw %>%
filter(gender != "male", gender != "female")
#> # A tibble: 3 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jabba D… 175 1358 <NA> green-tan,… orange 600 herma…
#> 2 IG-88 200 140 none metal red 15.0 none
#> 3 BB8 NA NA none none black NA none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# (+) Of which species are there at least 2 different gender values?
sw %>%
group_by(species, gender) %>%
count() %>% # table shows species by gender:
group_by(species) %>% # Which species appear more than once in this table?
count() %>%
filter(nn > 1)
#> # A tibble: 5 x 2
#> # Groups: species [5]
#> species nn
#> <chr> <int>
#> 1 Droid 2
#> 2 Human 2
#> 3 Kaminoan 2
#> 4 Twi'lek 2
#> 5 <NA> 2
# alternative (and shorter) solution:
sw %>%
group_by(species)%>%
summarise(n_gender_vals = n_distinct(gender)) %>%
filter(n_gender_vals >= 2)
#> # A tibble: 5 x 2
#> species n_gender_vals
#> <chr> <int>
#> 1 Droid 2
#> 2 Human 2
#> 3 Kaminoan 2
#> 4 Twi'lek 2
#> 5 <NA> 2
## (d) Homeworld issues: -----
# (+) Popular homes: From which homeworld do the most indidividuals (rows) come from?
sw %>%
group_by(homeworld) %>%
count() %>%
arrange(desc(n))
#> # A tibble: 49 x 2
#> # Groups: homeworld [49]
#> homeworld n
#> <chr> <int>
#> 1 Naboo 11
#> 2 Tatooine 10
#> 3 <NA> 10
#> 4 Alderaan 3
#> 5 Coruscant 3
#> 6 Kamino 3
#> 7 Corellia 2
#> 8 Kashyyyk 2
#> 9 Mirial 2
#> 10 Ryloth 2
#> # ... with 39 more rows
# => Naboo (with 11 individuals)
# (+) What is the mean height of all individuals with orange eyes from the most popular homeworld?
sw %>%
filter(homeworld == "Naboo", eye_color == "orange") %>%
summarise(n = n(),
mn_height = mean(height))
#> # A tibble: 1 x 2
#> n mn_height
#> <int> <dbl>
#> 1 3 209
## Note:
sw %>%
filter(eye_color == "orange") # => 8 individuals
#> # A tibble: 8 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jabba … 175 1358 <NA> green-tan,… orange 600 herma…
#> 2 Ackbar 180 83.0 none brown mott… orange 41.0 male
#> 3 Jar Ja… 196 66.0 none orange orange 52.0 male
#> 4 Roos T… 224 82.0 none grey orange NA male
#> 5 Rugor … 206 NA none green orange NA male
#> 6 Sebulba 112 40.0 none grey, red orange NA male
#> 7 Ben Qu… 163 65.0 none grey, gree… orange NA male
#> 8 Saesee… 188 NA none pale orange NA male
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
# (+) What is the mass and homeworld of the smallest droid?
sw %>%
filter(species == "Droid") %>%
arrange(height)
#> # A tibble: 5 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 R2-D2 96 32.0 <NA> white, blue red 33.0 <NA>
#> 2 R5-D4 97 32.0 <NA> white, red red NA <NA>
#> 3 C-3PO 167 75.0 <NA> gold yellow 112 <NA>
#> 4 IG-88 200 140 none metal red 15.0 none
#> 5 BB8 NA NA none none black NA none
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
## (4) Group summaries: -----
# (+) Compute the median, mean, and standard deviation of `height` for all droids.
sw %>%
filter(species == "Droid") %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
md_height = median(height, na.rm = TRUE),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
#> # A tibble: 1 x 5
#> n not_NA_h md_height mn_height sd_height
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 5 4 132 140 52.0
# (+) Compute the average height and mass by species and save the result as `h_m`:
h_m <- sw %>%
group_by(species) %>%
summarise(n = n(),
not_NA_h = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
not_NA_m = sum(!is.na(mass)),
md_mass = median(mass, na.rm = TRUE)
)
h_m
#> # A tibble: 38 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Aleena 1 1 79.0 1 15.0
#> 2 Besalisk 1 1 198 1 102
#> 3 Cerean 1 1 198 1 82.0
#> 4 Chagrian 1 1 196 0 NA
#> 5 Clawdite 1 1 168 1 55.0
#> 6 Droid 5 4 140 4 53.5
#> 7 Dug 1 1 112 1 40.0
#> 8 Ewok 1 1 88.0 1 20.0
#> 9 Geonosian 1 1 183 1 80.0
#> 10 Gungan 3 3 209 2 74.0
#> # ... with 28 more rows
# (+) Use `h_m` to list the 3 species with the smallest individuals (in terms of mean height)?
h_m %>% arrange(mn_height) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Yoda's species 1 1 66.0 1 17.0
#> 2 Aleena 1 1 79.0 1 15.0
#> 3 Ewok 1 1 88.0 1 20.0
# (+) Use `h_m` to list the 3 species with the heaviest individuals (in terms of median mass)?
h_m %>% arrange(desc(md_mass)) %>% slice(1:3)
#> # A tibble: 3 x 6
#> species n not_NA_h mn_height not_NA_m md_mass
#> <chr> <int> <int> <dbl> <int> <dbl>
#> 1 Hutt 1 1 175 1 1358
#> 2 Kaleesh 1 1 216 1 159
#> 3 Wookiee 2 2 231 2 124
pt <- pt + 12 # increment point totalTask 7: Hot and wet flights
Using the nycflights13::weather dataset:
Use the data set nycflights13::weather for questions that require filter, arrange, select, group_by, summarise (count, NAs, means, medians), etc.
7a. Save the tibble nycflights13::weather as wt and report its dimensions. (1 point)
7b. Missing values and known unknowns:
How many missing (
NA) values doesswcontain? (1 point)What is the percentage of missing (
NA) values inwt? (1 point)What is the range (i.e., minimum and maximum value) of the
yearvariable? (1 point)
7c. How many observations (rows) does the data contain for each of the 3 airports (origin)? (1 point)
7d. Compute a new variable temp_dc that provides the temperature (in degrees Celsius) that corresponds to temp (in degrees Fahrenheit).
Hint: The formula for conversion from Fahrenheit (degrees F) to Celsius (degrees C) is: \(C = (F - 32) \times\ 5/9\).
Add your new temp_dc variable to a new dataset wt_2 and re-arrange its columns so that your new temp_dc variable appears next to temp. (2 points)
7e. When only considering “JFK” airport:
- What are the 3 (different) dates with the (a) coldest and (b) hottest temperatures at this airport?
Report the 3 dates and their extreme temperatures (in degrees Celsius) for (a) and (b). (2 points)
7f. Plot the amount of mean precipitation by month for each of the 3 airports (origin). (2 points)
Hint: First use dplyr to compute a table of means (by origin and month). Then use ggplot to draw a line or bar plot of the means.
7g. For each of the 3 airports:
When excluding extreme cases of precipitation (specifically, values of
precipgreater than 0.30):
Does it rain more during winter months (Oct to Mar) or during summer months (Apr to Sep)? (2 points)Plot the total amount of precipitation in winter vs. summer for each airport. (2 points)
Hints: Use filter to remove cases of extreme precipitation and create a logical variable (e.g., summer) that is TRUE for summer months and FALSE for winter months. The use dplyr to summarise the total amount (sum) of precipitation by origin and summer. The resulting tibble can be plotted as a bar chart (with different facets by origin).
library(nycflights13)
# nycflights13::weather
# ?weather
## How many observations (rows) and variables (columns) does the data set contain overall?
wt <- nycflights13::weather
dim(wt)
#> [1] 26130 15
## (b) missing values and ranges:
## - How many missing values does the `weather` data contain?
sum(is.na(weather)) # sum of NA values
#> [1] 3157
mean(is.na(weather)) # percentage
#> [1] 0.008054599
## - What is the range of values of the `year` variable?
range(weather$year)
#> [1] 2013 2013
## (c) How many observations (rows) does the data contain for each of the 3 airports (`origin`)?
weather %>%
group_by(origin) %>%
count()
#> # A tibble: 3 x 2
#> # Groups: origin [3]
#> origin n
#> <chr> <int>
#> 1 EWR 8708
#> 2 JFK 8711
#> 3 LGA 8711
## (d) Conversion from Fahrenheit to Celsius:
## Compute a variable `temp_dc` that provides the temperature (in degrees Celsius)
## that corresponds to `temp` (in degrees Fahrenheit).
## Fahrenheit (degrees F) to Celsius (degrees C) conversion:
## C = (F - 32) x 5/9.
## Add your new `temp_dc` variable to a new dataset `wt_2` and
## re-arrange its columns so that your `temp_dc` variable appears next to `temp`.
wt_2 <- wt %>%
mutate(temp_dc = (temp - 32) * 5/9) %>%
select(origin:temp, temp_dc, everything())
wt_2
#> # A tibble: 26,130 x 16
#> origin year month day hour temp temp_dc dewp humid wind_dir
#> <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 EWR 2013 1.00 1 0 37.0 2.80 21.9 54.0 230
#> 2 EWR 2013 1.00 1 1 37.0 2.80 21.9 54.0 230
#> 3 EWR 2013 1.00 1 2 37.9 3.30 21.9 52.1 230
#> 4 EWR 2013 1.00 1 3 37.9 3.30 23.0 54.5 230
#> 5 EWR 2013 1.00 1 4 37.9 3.30 24.1 57.0 240
#> 6 EWR 2013 1.00 1 6 39.0 3.90 26.1 59.4 270
#> 7 EWR 2013 1.00 1 7 39.0 3.90 27.0 61.6 250
#> 8 EWR 2013 1.00 1 8 39.0 3.90 28.0 64.4 240
#> 9 EWR 2013 1.00 1 9 39.9 4.40 28.0 62.2 250
#> 10 EWR 2013 1.00 1 10 39.0 3.90 28.0 64.4 260
#> # ... with 26,120 more rows, and 6 more variables: wind_speed <dbl>,
#> # wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>, time_hour
#> # <dttm>
## (e) Only considering "JFK" airport:
## What are the 3 (different) dates with the (a) coldest and (b) hottest temperatures there?
## Report the 3 dates and their extreme temperatures (in degrees Celsius) for (a) and (b).
JFK_temp <- wt_2 %>%
filter(origin == "JFK") %>%
arrange(temp_dc)
JFK_temp # => coldest days
#> # A tibble: 8,711 x 16
#> origin year month day hour temp temp_dc dewp humid wind_dir
#> <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 JFK 2013 1.00 23 9 12.0 -11.1 - 7.06 41.3 280
#> 2 JFK 2013 1.00 23 10 12.0 -11.1 - 5.08 45.4 280
#> 3 JFK 2013 1.00 23 11 12.0 -11.1 - 5.08 45.4 290
#> 4 JFK 2013 1.00 23 6 12.9 -10.6 - 9.04 36.0 290
#> 5 JFK 2013 1.00 23 7 12.9 -10.6 - 7.96 38.0 290
#> 6 JFK 2013 1.00 23 8 12.9 -10.6 - 7.06 39.7 290
#> 7 JFK 2013 1.00 23 12 12.9 -10.6 - 4.00 46.0 280
#> 8 JFK 2013 1.00 24 6 12.9 -10.6 - 2.02 50.5 40.0
#> 9 JFK 2013 5.00 9 2 13.1 -10.5 12.0 95.3 80.0
#> 10 JFK 2013 1.00 23 4 14.0 -10.0 - 9.94 32.9 300
#> # ... with 8,701 more rows, and 6 more variables: wind_speed <dbl>,
#> # wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>, time_hour
#> # <dttm>
JFK_temp %>% arrange(desc(temp)) # => hottest days
#> # A tibble: 8,711 x 16
#> origin year month day hour temp temp_dc dewp humid wind_dir
#> <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 JFK 2013 7.00 18 16 98.1 36.7 66.9 36.4 260
#> 2 JFK 2013 7.00 18 15 97.0 36.1 68.0 39.0 330
#> 3 JFK 2013 7.00 18 18 97.0 36.1 71.1 43.4 220
#> 4 JFK 2013 7.00 16 18 96.1 35.6 62.1 32.6 360
#> 5 JFK 2013 7.00 18 14 96.1 35.6 66.9 38.7 350
#> 6 JFK 2013 7.00 18 17 96.1 35.6 71.1 44.6 250
#> 7 JFK 2013 7.00 15 20 95.0 35.0 68.0 41.5 360
#> 8 JFK 2013 7.00 16 17 95.0 35.0 60.1 31.4 20.0
#> 9 JFK 2013 7.00 17 15 95.0 35.0 64.9 37.3 350
#> 10 JFK 2013 7.00 20 20 95.0 35.0 69.1 43.1 260
#> # ... with 8,701 more rows, and 6 more variables: wind_speed <dbl>,
#> # wind_gust <dbl>, precip <dbl>, pressure <dbl>, visib <dbl>, time_hour
#> # <dttm>
## Aggregation examples: -----
## (x) Average temperature per month:
## (used in class):
mn_temp_month <- wt_2 %>%
# group_by(origin, month) %>%
group_by(month) %>%
summarise(n = n(),
n_not_NA = sum(!is.na(temp_dc)),
mn_temp_dc = mean(temp_dc, na.rm = TRUE))
mn_temp_month
#> # A tibble: 12 x 4
#> month n n_not_NA mn_temp_dc
#> <dbl> <int> <int> <dbl>
#> 1 1.00 2229 2229 2.02
#> 2 2.00 2010 2010 1.20
#> 3 3.00 2227 2227 4.34
#> 4 4.00 2159 2159 10.9
#> 5 5.00 2232 2232 16.4
#> 6 6.00 2160 2160 22.3
#> 7 7.00 2228 2228 26.7
#> 8 8.00 2217 2216 23.6
#> 9 9.00 2159 2159 19.7
#> 10 10.0 2212 2212 15.6
#> 11 11.0 2138 2138 7.28
#> 12 12.0 2159 2159 3.54
ggplot(mn_temp_month, aes(x = month, y = mn_temp_dc)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = 1:12) +
theme_bw()
## (f) Plot the amount of mean precipitation (by origin and month):
wt %>%
group_by(origin, month) %>%
summarise(n = n(),
n_not_NA = sum(!is.na(precip)),
mn_precip = mean(precip, na.rm = TRUE)) %>%
ggplot(aes(x = month, y = mn_precip, color = origin, shape = origin)) +
geom_point(size = 2) +
geom_line(size = 1) +
# geom_bar(aes(fill = origin), stat = "identity", position = "dodge") +
scale_x_continuous(breaks = 1:12) +
labs(title = "Mean precipitation by month and origin",
x = "Month", y = "Mean precipitation",
caption = "[Data from nycflights13::weather]") +
theme_bw()
## Computation and visualization to answer a question: ----
## (g) For each of the 3 airports:
## When excluding extreme cases of precipitation (values of `precip` greater than 0.30):
## Does it rain more during winter (Oct to Mar) or during summer (Apr to Sep) months?
## Plot the total amount of precipitation in winter vs. summer for each airport.
## Inspect data:
wt # month is given numerically!
#> # A tibble: 26,130 x 15
#> origin year month day hour temp dewp humid wind_dir wind_speed
#> <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 EWR 2013 1.00 1 0 37.0 21.9 54.0 230 10.4
#> 2 EWR 2013 1.00 1 1 37.0 21.9 54.0 230 13.8
#> 3 EWR 2013 1.00 1 2 37.9 21.9 52.1 230 12.7
#> 4 EWR 2013 1.00 1 3 37.9 23.0 54.5 230 13.8
#> 5 EWR 2013 1.00 1 4 37.9 24.1 57.0 240 15.0
#> 6 EWR 2013 1.00 1 6 39.0 26.1 59.4 270 10.4
#> 7 EWR 2013 1.00 1 7 39.0 27.0 61.6 250 8.06
#> 8 EWR 2013 1.00 1 8 39.0 28.0 64.4 240 11.5
#> 9 EWR 2013 1.00 1 9 39.9 28.0 62.2 250 12.7
#> 10 EWR 2013 1.00 1 10 39.0 28.0 64.4 260 12.7
#> # ... with 26,120 more rows, and 5 more variables: wind_gust <dbl>, precip
#> # <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
# ggplot(weather, aes(x = precip)) +
# geom_histogram(binwidth = 0.01, fill = seeblau)
## Preparation: Filter out extreme values and add a summer variable:
weather_season <- wt %>%
filter(precip <= .30) %>%
mutate(summer = (month > 3 & month < 10)#,
# winter = (month < 4 | month > 9)
)
## Computation: Sum of precipitation by origin and summer season:
sum_rain_season <- weather_season %>%
group_by(origin, summer) %>%
summarise(n = n(),
# n_not_NA = sum(!is.na(precip)),
# mn_precip = mean(precip, na.rm = TRUE),
sum_precip = sum(precip, na.rm = TRUE))
sum_rain_season
#> # A tibble: 6 x 4
#> # Groups: origin [?]
#> origin summer n sum_precip
#> <chr> <lgl> <int> <dbl>
#> 1 EWR F 4320 10.9
#> 2 EWR T 4380 11.9
#> 3 JFK F 4323 10.3
#> 4 JFK T 4379 10.3
#> 5 LGA F 4323 10.0
#> 6 LGA T 4384 9.21
## Visualization: Sum of precipitation as a bar chart:
ggplot(weather_season, aes(x = summer, fill = summer)) +
facet_wrap(~origin) +
geom_bar(aes(weight = precip), na.rm = TRUE) +
scale_fill_manual(name = "Summer:", values = c("steelblue3", "firebrick")) +
labs(title = "Sum of precipitation in winter vs. summer months",
x = "Summer", y = "Sum of precipitation",
caption = "[Data from nycflights13::weather]") +
theme_bw()
pt <- pt + 15 # increment point totalTask 8: Not all outliers are alike
This task examines the statistical definition of outliers and uses a generated dataset (entitled out.csv and available at http://rpository.com/ds4psy/mt/out.csv). Use the following read_csv() command to obtain and load it into R:
## Load data (as comma-separated file):
data <- read_csv("http://rpository.com/ds4psy/mt/out.csv") # from online source
## Alternatively (from local source):
# data <- read_csv("out.csv") # from current directoryAn outlier can be defined as an individual whose value in some metric deviates by more than a given criterion (e.g., 2 standard deviations) from the mean. But this definition is incomplete unless it also specifies an appropriate reference group. This task explores the implications of different reference groups.
8a. Save the data into a tibble data and report its number of observations and variables. (1 point)
8b. How many missing data values are there in data? (1 point)
8c. What is the gender (or sex) distribution in this sample? (1 point)
8d. Create a plot that shows the distribution of height values for each gender. (1 point)
8e. Compute 2 new variables that signal and distinguish between 2 types of outliers in terms of height:
outliers relative to the
heightof the overall sample (i.e., individuals withheightvalues deviating more than 2 SD from the overall mean ofheight) (1 point);outliers relative to the
heightof some subgroup’s mean and SD. Here, a suitable subgroup to consider is every person’s gender (i.e., individuals withheightvalues deviating more than 2 SD from the meanheightof their own gender). (1 point)
Hints: As both variable signal whether or not someone is an outlier they should be defined as logicals (being either TRUE or FALSE) and added as new columns to data (via appropriate mutate commands). While the 1st variable can be computed based on the mean and SD of the overall sample, the 2nd variable can be computed after grouping data by gender and then computing and using the corresponding mean and SD values. The absolute difference between 2 numeric values x and y is provided by abs(x - y).
8f. Use the 2 new variables to define and identify 2 subgroups of people:
out_1: Individuals (females and males) withheightvalues that are outliers relative to both the entire sample and the sample of their own gender. How many such individuals are indata? (1 point)out_2: Individuals (females and males) withheightvalues that are not outliers relative to the entire population, but are outliers relative to their own gender. How many such individuals are indata? (1 point)
8g. Bonus task: Visualize the raw values and distributions of height for both types of outliers (out_1 and out_2) in 2 separate plots and describe the height and sex combination of the individuals shown in each plot. (2 bonus points)
## (a) Load and inspect data:
# data <- read_csv("out.csv") # read in csv-file
# data <- as_tibble(data) # if data is not already a tibble
dim(data) # => 1000 observations (rows) x 3 variables (columns)
#> [1] 1000 3
## (b) Missing data points:
sum(is.na(data)) # => 18 missing values
#> [1] 18
## (c) Gender distribution:
data %>%
group_by(sex) %>%
count()
#> # A tibble: 2 x 2
#> # Groups: sex [2]
#> sex n
#> <chr> <int>
#> 1 female 507
#> 2 male 493
# => 50.7% females, 49.3% males.
## (d) Distributions of `height` as density plot:
ggplot(data, aes(x = height)) +
geom_density(fill = "gold", alpha = 2/3) +
geom_density(aes(fill = sex), alpha = 2/5) +
labs(title = "Distribution of heights overall and by gender",
fill = "Gender") +
scale_fill_manual(values = c("firebrick", "steelblue3")) +
theme_bw()
# Note: To avoid the Warning about removing 18 cases with NA-values,
# we could first filter out those cases:
# non_NA_data <- filter(data, !is.na(height))
## Alternative solution as 2 histograms:
ggplot(data) +
facet_wrap(~sex) +
geom_histogram(aes(x = height, fill = sex), binwidth = 5, color = "grey10") +
labs(title = "Distribution of heights by gender",
fill = "Gender") +
scale_fill_manual(values = c("firebrick", "steelblue3")) +
theme_bw()
## (+) Included in (e), but also possible to do separately:
## Compute the number, means and SD of height values in 2 ways:
{
## 1. overall:
data %>%
summarise(n = n(),
n_not_NA = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
## 2. by gender:
data %>%
group_by(sex) %>%
summarise(n = n(),
n_not_NA = sum(!is.na(height)),
mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE))
}
#> # A tibble: 2 x 5
#> sex n n_not_NA mn_height sd_height
#> <chr> <int> <int> <dbl> <dbl>
#> 1 female 507 501 169.0679 8.193619
#> 2 male 493 481 180.5676 11.026677
## (e) Detecting and marking outliers (by logical variables): -----
## Compute the means, SDs, and corresponding outliers in 2 ways:
crit <- 2 # criterion value for detecting outliers (in SD units)
data_out <- data %>%
# 1. Compute means, SD, and outliers for overall sample:
mutate(mn_height = mean(height, na.rm = TRUE),
sd_height = sd(height, na.rm = TRUE),
out_height = abs(height - mn_height) > (crit * sd_height)) %>%
group_by(sex) %>%
# 2. Compute same metrics for subgroups (by sex):
mutate(mn_sex_height = mean(height, na.rm = TRUE),
sd_sex_height = sd(height, na.rm = TRUE),
out_sex_height = abs(height - mn_sex_height) > (crit * sd_sex_height))
# data_out
## (f) Identify 2 types of outliers: -----
## 1. Outliers relative to the entire population AND to their own gender:
out_1 <- data_out %>%
filter(out_height & out_sex_height) %>%
arrange(sex, height)
nrow(out_1) # => 21 individuals.
#> [1] 21
## 2. Outliers relative to their own gender, but NOT relative to the entire population:
out_2 <- data_out %>%
filter(!out_height & out_sex_height) %>%
arrange(sex, height)
nrow(out_2) # => 24 individuals.
#> [1] 24
## (g) Bonus task:
## Visualization and interpretation of both types of outliers: -----
## 1. Showing out_1:
ggplot(out_1, aes(x = sex, y = height)) +
geom_violin(aes(fill = sex)) +
geom_jitter(size = 4, alpha = 2/3) +
scale_fill_manual(values = c("firebrick", "steelblue3")) +
labs(title = "Outliers relative to both overall sample and gender",
x = "Gender", y = "Height (in cm)",
fill = "Gender:") +
theme_bw()
# Interpretation:
# `out_1` contains mostly short women (except for 1 tall woman)
# and mostly tall men (except for 2 short men).
## 2. Showing out_2:
ggplot(out_2, aes(x = sex, y = height)) +
geom_violin(aes(fill = sex)) +
geom_jitter(size = 4, alpha = 2/3) +
scale_fill_manual(values = c("firebrick", "steelblue3")) +
labs(title = "Outliers relative to gender but not overall sample",
x = "Gender", y = "Height (in cm)",
fill = "Gender:") +
theme_bw()
# Interpretation:
# `out_2` contains individuals which are either tall women or short men.
pt <- pt + (8 + 0) # increment point total (leaving out 2 bonus points)Task 9: The length of flights
Using the nycflights13::flights dataset to compute the actual duration of flights:
Compute a variable
true_durationas the duration of each flight (in minutes) from itsdep_timeandarr_time.How does it relate to the
air_timevariable in the data set? (Plot the relationship between both variables.)
library(nycflights13) # loads flights dataset
compute_duration <- function(dep_min, arr_min) {
dur <- NA # initialize
# Distinguish between 2 cases:
if (dep_min < arr_min) { # dep before arr (i.e., same day):
dur <- arr_min - dep_min
} else { # dep later than arr (i.e., different days):
dur <- (arr_min + 24 * 60) - dep_min
}
return(dur)
}
# Check for vectors:
dep <- seq(0, 24*60, by = +15)
arr <- seq(24*60, 0, by = -15)
# compute_duration(dep, arr)
# ???: How to apply a function to each pair of values of 2 vectors/columns?
d <- flights %>%
filter(origin == "JFK") %>% # to reduce size of dataset
select(dep_time, arr_time, air_time) %>%
mutate(dep_time_min = ((dep_time %/% 100) * 60) + (dep_time %% 100),
arr_time_min = ((arr_time %/% 100) * 60) + (arr_time %% 100),
# true_duration = compute_duration(dep_time_min, arr_time_min),
true_duration = arr_time_min - dep_time_min) %>%
filter(air_time > 0, true_duration > 0)
p <- ggplot(d, (aes(x = air_time, y = true_duration))) +
geom_point(alpha = 1/4) +
geom_abline(intercept = 0, slope = 1, linetype = 2, size = 1, color = "red3") +
theme_bw()Notes
The
dep_timeandarr_timeare specified in terms of hour and minutes. As an hour contains 60 minutes (rather than 100), we first use modular arithmetic to transform both variables into a metric of minutes (see variablesdep_time_minandarr_time_min).For flights departing and arriving on the same day, we can simply subtract
dep_time_minfromarr_time_minto obtaintrue_duration. However, if a flight departs before and arrives after midnight (i.e., arrives on the next day), this measure would yield a negative result. To correct for this, we need to add 24 hours (24 times 60 minutes) toarr_timewheneverarr_timeis before (or smaller than)dep_time. [This assumes that there are no flights exceeding 24 hours.]Seems too difficult at this stage, as it requires conditional execution (for flights departing and arriving on the same vs. different days) and/or functions.
In chapter 16 (Dates and times: http://r4ds.had.co.nz/dates-and-times.html), this problem is solved by introducing a Boolean variable for overnight flights and re-computing air_time by subtracting date-time objects (as intervals).
7: Exploratory Data Analysis
Key skills
Transform and visualize datasets to:
- detect and deal with missing (NA) values;
- view and interpret distributions of variables (e.g., to see patterns and deal with unusual or extreme values);
- view and interpret relationships between (categorical/continuous) variables.
Example tasks
Additional examples of tasks are contained in the other chapters (see tasks on “multiple chapters” below).
Task 10: Cars
Using the mtcars dataset:
11. The mtcars data (contained in R datasets) contains …
## Data to use:
# ?datasets::mtcars
# mtcars
## (0) Convert into a tibble:
df <- as_tibble(rownames_to_column(mtcars, var = "model"))
df$cyl <- factor(df$cyl)
df$am <- factor(df$am)
df
#> # A tibble: 32 x 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
#> 1 Mazd… 21.0 6 160 110 3.90 2.62 16.5 0 1 4.00 4.00
#> 2 Mazd… 21.0 6 160 110 3.90 2.88 17.0 0 1 4.00 4.00
#> 3 Dats… 22.8 4 108 93.0 3.85 2.32 18.6 1.00 1 4.00 1.00
#> 4 Horn… 21.4 6 258 110 3.08 3.22 19.4 1.00 0 3.00 1.00
#> 5 Horn… 18.7 8 360 175 3.15 3.44 17.0 0 0 3.00 2.00
#> 6 Vali… 18.1 6 225 105 2.76 3.46 20.2 1.00 0 3.00 1.00
#> 7 Dust… 14.3 8 360 245 3.21 3.57 15.8 0 0 3.00 4.00
#> 8 Merc… 24.4 4 147 62.0 3.69 3.19 20.0 1.00 0 4.00 2.00
#> 9 Merc… 22.8 4 141 95.0 3.92 3.15 22.9 1.00 0 4.00 2.00
#> 10 Merc… 19.2 6 168 123 3.92 3.44 18.3 1.00 0 4.00 4.00
#> # ... with 22 more rows
## (1) Distribution of mpg: ----
range(df$mpg)
#> [1] 10.4 33.9
ggplot(df, aes(x = mpg)) +
geom_histogram(binwidth = 5)
## (2) Group means and boxplot: Mean mpg by cylinder: ----
df %>%
group_by(cyl) %>%
summarise(n = n(),
md_mpg = median(mpg),
mn_mpg = mean(mpg)
)
#> # A tibble: 3 x 4
#> cyl n md_mpg mn_mpg
#> <fct> <int> <dbl> <dbl>
#> 1 4 11 26.0 26.7
#> 2 6 7 19.7 19.7
#> 3 8 14 15.2 15.1
ggplot(df, aes(x = cyl, y = mpg, color = cyl)) +
geom_boxplot() +
# geom_violin() +
geom_jitter()
## (3) Scatterplot: ----
## Show value of `disp` (on y-axis) by `hp` (on x-axis), grouped by `am`:
ggplot(df, aes(x = hp, y = disp, fill = am)) +
geom_point(shape = 21, size = 2.5) +
# geom_smooth() +
geom_text(aes(label = model), size = 2.5, vjust = 0, hjust = 0, nudge_x = 5) +
theme_bw()
## Alternative (using facets):
ggplot(df, aes(x = hp, y = disp)) +
facet_wrap(~am) +
geom_point(size = 2.5) +
# geom_smooth() +
geom_text(aes(label = model), size = 2.5, vjust = 0, hjust = 0, nudge_x = 5) +
theme_bw()
## Conclusions: ----
# - There is a positive correlation between horsepower and displacement
# (for both types of transmission).
# - outliers: Ford Pantera L and Maserati
# have more hp and disp than other cars with manual transmission.Tasks involved
- Plot variable distributions (histogram);
- Compute descriptive group measures (e.g., counts, means, SDs);
- Plot raw data and group means (boxplot);
- Plot a scatterplot of 2 continuous variables;
- Interpret tables and graphs and draw conclusions.
10: Tibbles
Key skills
Turn data into tibbles:
- convert data frames into tibbles (by
as_tibble);
- create tibbles from tabular data (by
tibbleandtribble).
Example tasks
The commands to create and convert to tibbles are included in tasks to other chapters.
12: Tidy Data
Key skills
- Identify and create “tidy” datasets.
- Wrangle data to:
gather(wider) datasets into longer ones;spread(longer) datasets into wider ones;separateandunitevariables (columns).
Example tasks
Task 12: Taking stocks
The following table shows the start and end price of 3 stocks on 3 days (d1, d2, d3):
| stock | d1_start | d1_end | d2_start | d2_end | d3_start | d3_end |
|---|---|---|---|---|---|---|
| Amada | 2.5 | 3.6 | 3.5 | 4.2 | 4.4 | 2.8 |
| Betix | 3.3 | 2.9 | 3.0 | 2.1 | 2.3 | 2.5 |
| Cevis | 4.2 | 4.8 | 4.6 | 3.1 | 3.2 | 3.7 |
12a. Create a tibble st that contains this data in this (wide) format. (1 point)
12b. Transform st into a longer table st_long that contains 18 rows and only 1 numeric variable for all stock prices. Adjust this table so that the day and time appear as 2 separate columns. (2 points)
12c. Create a graph that shows the 3 stocks’ end prices (on the y-axis) over the 3 days (on the x-axis). (1 point)
12d. Spread st_long into a wider table that contains start and end prices as 2 distinct variables (columns) for each stock and day. (1 point)
# library(tidyverse)
## (a) Enter stock data (in wide format) as a tibble:
st <- tribble(
~stock, ~d1_start, ~d1_end, ~d2_start, ~d2_end, ~d3_start, ~d3_end,
#-----|----------|--------|----------|--------|----------|--------|
"Amada", 2.5, 3.6, 3.5, 4.2, 4.4, 2.8,
"Betix", 3.3, 2.9, 3.0, 2.1, 2.3, 2.5,
"Cevis", 4.2, 4.8, 4.6, 3.1, 3.2, 3.7
)
dim(st)
#> [1] 3 7
## Note data structure:
## 2 nested factors: day (1 to 3), type (start or end).
## (b) Change from wide to long format
## that contains the day (d1, d2, d3) and type (start vs. end) as separate columns:
st_long <- st %>%
gather(d1_start:d3_end, key = "key", value = "val") %>%
separate(key, into = c("day", "time")) %>%
arrange(stock, day, time) # optional: arrange rows
st_long
#> # A tibble: 18 x 4
#> stock day time val
#> <chr> <chr> <chr> <dbl>
#> 1 Amada d1 end 3.6
#> 2 Amada d1 start 2.5
#> 3 Amada d2 end 4.2
#> 4 Amada d2 start 3.5
#> 5 Amada d3 end 2.8
#> 6 Amada d3 start 4.4
#> 7 Betix d1 end 2.9
#> 8 Betix d1 start 3.3
#> 9 Betix d2 end 2.1
#> 10 Betix d2 start 3.0
#> 11 Betix d3 end 2.5
#> 12 Betix d3 start 2.3
#> 13 Cevis d1 end 4.8
#> 14 Cevis d1 start 4.2
#> 15 Cevis d2 end 3.1
#> 16 Cevis d2 start 4.6
#> 17 Cevis d3 end 3.7
#> 18 Cevis d3 start 3.2
## (c) Plot the end values (on the y-axis) of the 3 stocks over 3 days (x-axis):
st_long %>%
filter(time == "end") %>%
ggplot(aes(x = day, y = val, color = stock, shape = stock)) +
geom_point(size = 4) +
geom_line(aes(group = stock)) +
## Pimping plot:
labs(title = "End prices of stocks",
x = "Day", y = "End price",
shape = "Stock:", color = "Stock:") +
theme_bw()
## (d) Change st_long into a wider format that lists start and end as 2 distinct variables (columns):
st_long %>%
spread(key = time, value = val) %>%
mutate(day_nr = parse_integer(str_sub(day, 2, 2))) # optional: get day_nr as integer variable
#> # A tibble: 9 x 5
#> stock day end start day_nr
#> <chr> <chr> <dbl> <dbl> <int>
#> 1 Amada d1 3.6 2.5 1
#> 2 Amada d2 4.2 3.5 2
#> 3 Amada d3 2.8 4.4 3
#> 4 Betix d1 2.9 3.3 1
#> 5 Betix d2 2.1 3.0 2
#> 6 Betix d3 2.5 2.3 3
#> 7 Cevis d1 4.8 4.2 1
#> 8 Cevis d2 3.1 4.6 2
#> 9 Cevis d3 3.7 3.2 3
pt <- pt + 5 # increment point totalMultiple chapters
Task 13: Flower power
The iris data (contained in R datasets) provides the measurements (in cm) of plant parts (length and width of sepal and petal parts) for 50 flowers from each of 3 species of iris (called setosa, versicolor, and virginica).
13a. Save datasets::iris a tibble ir that contains this data and inspect it. Are there any missing values? (2 points)
13b. Compute a summary table that shows the means of the 4 measurement columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) for each of the 3 Species (in rows). Save the resulting table of means as a tibble im1 . (2 points)
13c. Create a histogram that shows the distribution of Sepal.Width values across all species. (1 point)
13d. Create a plot that shows the shape of the distribution of Sepal.Width values for each species. (1 point)
13e. Create a plot that shows Petal.Width as a function of Sepal.Width separately (i.e., in 3 facets) for each species. (1 point)
# ?iris
## (a) Turn into tibble and inspect: -----
ir <- as_tibble(datasets::iris)
dim(ir) # => 150 observations (rows) x 5 variables (columns)
#> [1] 150 5
sum(is.na(ir)) # => 0 missing values
#> [1] 0
## (b) Compute counts and means by species: -----
im1 <- ir %>%
group_by(Species) %>%
summarise(n = n(),
mn_sep.len = mean(Sepal.Length),
mn_sep.wid = mean(Sepal.Width),
mn_pet.len = mean(Petal.Length),
mn_pet.wid = mean(Petal.Width)
)
# Print im1 (as table with 4 variables of means):
knitr::kable(im1, caption = "Average iris measures (4 variables of mean values).") | Species | n | mn_sep.len | mn_sep.wid | mn_pet.len | mn_pet.wid |
|---|---|---|---|---|---|
| setosa | 50 | 5.006 | 3.428 | 1.462 | 0.246 |
| versicolor | 50 | 5.936 | 2.770 | 4.260 | 1.326 |
| virginica | 50 | 6.588 | 2.974 | 5.552 | 2.026 |
## Graphical exploration: -----
## Distribution of 1 (continuous) variable:
## (c) Distribution of Sepal.Width across species:
ggplot(ir, aes(x = Sepal.Width)) +
geom_histogram(binwidth = .1, fill = "forestgreen") +
labs(title = "Distribution of sepal width across iris species") +
theme_bw()
## Distributions/relationships between 2 variables (1 categorical, 1 continuous):
## (d) The distributions of Sepal.Width by species:
ggplot(ir, aes(x = Species, y = Sepal.Width, fill = Species, shape = Species)) +
geom_violin() +
geom_point(size = 4, alpha = 1/2, position = "jitter") +
labs(title = "Distributions of sepal width by iris species") +
theme_bw()
## Alternative solution using density plots:
ggplot(ir, aes(x = Sepal.Width, fill = Species)) +
facet_wrap(~Species) +
geom_density() +
labs(title = "Distributions of sepal width by iris species") +
coord_fixed() +
theme_bw()
## Relationships between 2 variables (2 continuous):
## (e) Petal.Width as a function of Sepal.Width by iris species:
ggplot(ir, aes(x = Sepal.Width, y = Petal.Width, color = Species, shape = Species)) +
facet_wrap(~Species) +
geom_jitter(size = 3, alpha = 2/3) +
# geom_density2d() +
coord_fixed() +
labs(title = "Petal width as a function of sepal width by iris species") +
theme_bw()
pt <- pt + 7 # increment point total
## next: Turn iris df into long format using tidyr::gather -----
## (see below)13f. Re-format your tibble ir into a tibble ir_long which in “long format”. (2 points)
Hint: Use tidyr::gather and separate to turn ir into a tibble that contains only 1 dependent variable for the value of measurements (e.g., val), but 2 categorical variables that specify the part (Sepal vs. Petal) and metric (Length vs. Width) of each observation.
13g. Use ir_long to recompute the subgroup means (for each combination of species, plant part, and metric) computed in 13b.. Save the resulting table of means as a tibble im2 and verify that they have not changed from im1 above. (2 points)
13h. Visualize the relationships between the means of im2 (i.e., the mean measurements by plant part and metric) separately for each species. (2 points)
Hints: This task asks for showing the value of a continuous variable (the value of means) as a function of 2 categorical variables (the plant part and type of metric). Possible solutions could incorporate either geom_line or geom_tile and use different facets for different Species.
13i. Bonus task: Re-format your tibble ir_long (in long format) into a wider tibble ir_short that corresponds to the original ir dataset. (2 bonus points)
Hints: This task calls for applying tidyr::spread and unite commands to ir_long. However, spread will encounter an error unless every individual plant is identified by a unique variable (e.g., id number). This can be achieved by adding a numeric counter variable id (with values of rep(1:50, 3)) to ir before creating ir_long.
## Data (from above):
# ?iris
# ir <- as_tibble(datasets::iris)
# ir
## (f) Re-format ir into long format (using tidyr::gather) -----
ir_long <- ir %>%
mutate(id = rep(1:50, 3)) %>% # add a unique id to each plant [to enable (i) below]
gather(Sepal.Length:Petal.Width, key = "type", value = "val") %>%
separate("type", into = c("part", "metric"), sep = "\\.")
ir_long
#> # A tibble: 600 x 5
#> Species id part metric val
#> * <fctr> <int> <chr> <chr> <dbl>
#> 1 setosa 1 Sepal Length 5.1
#> 2 setosa 2 Sepal Length 4.9
#> 3 setosa 3 Sepal Length 4.7
#> 4 setosa 4 Sepal Length 4.6
#> 5 setosa 5 Sepal Length 5.0
#> 6 setosa 6 Sepal Length 5.4
#> 7 setosa 7 Sepal Length 4.6
#> 8 setosa 8 Sepal Length 5.0
#> 9 setosa 9 Sepal Length 4.4
#> 10 setosa 10 Sepal Length 4.9
#> # ... with 590 more rows
dim(ir_long) # => 600 rows x 5 columns
#> [1] 600 5
## (g) Recompute group counts and means from ir_long: -----
im2 <- ir_long %>%
group_by(Species, part, metric) %>%
summarise(n = n(),
mn_val = mean(val)
)
# Print im2 (as table with 4 variables of means):
knitr::kable(im2, caption = "Average iris measures (1 variable of mean values).")| Species | part | metric | n | mn_val |
|---|---|---|---|---|
| setosa | Petal | Length | 50 | 1.462 |
| setosa | Petal | Width | 50 | 0.246 |
| setosa | Sepal | Length | 50 | 5.006 |
| setosa | Sepal | Width | 50 | 3.428 |
| versicolor | Petal | Length | 50 | 4.260 |
| versicolor | Petal | Width | 50 | 1.326 |
| versicolor | Sepal | Length | 50 | 5.936 |
| versicolor | Sepal | Width | 50 | 2.770 |
| virginica | Petal | Length | 50 | 5.552 |
| virginica | Petal | Width | 50 | 2.026 |
| virginica | Sepal | Length | 50 | 6.588 |
| virginica | Sepal | Width | 50 | 2.974 |
## Check: Compare sums of all means in im1 vs. im2:
sum1 <- sum(im1$mn_sep.len) + sum(im1$mn_sep.wid) +
sum(im1$mn_pet.len) + sum(im1$mn_pet.wid) # => 41.574
sum2 <- sum(im2$mn_val) # => 41.574
sum1 == sum2 # => TRUE (qed)
#> [1] TRUE
## Showing the value of a continuous variable by 2 categorical variables: -----
## (h) Visualize the means of im2 (i.e., mean measurements by plant part and metric)
## as a line plot separately for each species:
ggplot(im2, aes(x = part, y = mn_val, group = metric, color = metric)) +
facet_wrap(~Species) +
geom_point(aes(shape = metric), size = 3) +
geom_line(aes(linetype = metric)) +
labs(title = "Mean petal and sepal lengths and widths by iris species") +
theme_bw()
## Alternative solution using tile plots:
ggplot(im2, aes(x = part, y = metric)) +
facet_wrap(~Species) +
geom_tile(aes(fill = mn_val)) +
geom_text(aes(label = mn_val), color = "white") +
labs(title = "Mean petal and sepal lengths and widths by iris species") +
theme_bw()
## Alternative: Using raw values (ir_long) to visualize the distributions
## of val as a function of Species, plant part, and metric:
ggplot(ir_long, aes(x = part, y = val, fill = Species)) +
facet_wrap(~Species) +
geom_violin() +
geom_jitter(aes(shape = metric), size = 2, alpha = 2/3) +
labs(title = "Mean petal and sepal lengths and widths by iris species") +
theme_bw()
## (i) Bonus task: -----
## See (f) above for the additional id-column for each plant:
## mutate(id = rep(1:50, 3)) %>% ...
## Re-format ir_long into shorter format (using tidyr::spread) -----
ir_short <- ir_long %>%
unite(type, part, metric, sep = ".") %>% # unite part and metric into "type" column
spread(key = type, value = val) %>% # spread "type" variable into multiple columns
arrange(Species, id)
ir_short
#> # A tibble: 150 x 6
#> Species id Petal.Length Petal.Width Sepal.Length Sepal.Width
#> <fctr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 1 1.4 0.2 5.1 3.5
#> 2 setosa 2 1.4 0.2 4.9 3.0
#> 3 setosa 3 1.3 0.2 4.7 3.2
#> 4 setosa 4 1.5 0.2 4.6 3.1
#> 5 setosa 5 1.4 0.2 5.0 3.6
#> 6 setosa 6 1.7 0.4 5.4 3.9
#> 7 setosa 7 1.4 0.3 4.6 3.4
#> 8 setosa 8 1.5 0.2 5.0 3.4
#> 9 setosa 9 1.4 0.2 4.4 2.9
#> 10 setosa 10 1.5 0.1 4.9 3.1
#> # ... with 140 more rows
## Verify identity of all measurement values in ir and ir_short:
all.equal(ir$Sepal.Length, ir_short$Sepal.Length) &
all.equal(ir$Sepal.Width, ir_short$Sepal.Width) &
all.equal(ir$Petal.Length, ir_short$Petal.Length) &
all.equal(ir$Petal.Width, ir_short$Petal.Width) # => TRUE (qed)
#> [1] TRUE
pt <- pt + (6 + 0) # increment point total (leaving out 2 bonus points) Appendix
Creating data
Generating datasets with specific properties not yet used in the above tasks.
Create numeracy data
Create a (psychological) dataset numeracy that allows illustrating contents from all chapters.
#> # A tibble: 2 x 11
#> gender bnt_1_nNA bnt_1_mn bnt_2_nNA bnt_2_mn bnt_3_nNA bnt_3_mn
#> <fctr> <int> <dbl> <int> <dbl> <int> <dbl>
#> 1 female 531 0.569 531 0.701 531 0.454
#> 2 male 469 0.544 469 0.467 469 0.514
#> # ... with 4 more variables: bnt_4_nNA <int>, bnt_4_mn <dbl>,
#> # bnt_sum_nNA <int>, bnt_sum_mn <dbl>
#> [1] Sat Wed Mon Wed Sat Fri Wed Mon Sat Mon Fri Mon Mon Fri Wed Wed Wed
#> [18] Wed Sat Mon Tue Sat Sun Tue Sun Fri Sat Thu Sat Sat Mon Sat Thu Wed
#> [35] Thu Wed Tue Thu Mon Mon Sat Wed Fri Wed Sat Wed Wed Sat Thu Mon Wed
#> [52] Tue Sat Tue Tue Tue Mon Thu Thu Fri Thu Tue Fri Wed Thu Wed Fri Mon
#> [69] Fri Sun Mon Thu Mon Tue Mon
#> [ reached getOption("max.print") -- omitted 925 entries ]
#> Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
#> # A tibble: 2 x 4
#> gender n mn_g_iq mn_s_iq
#> <fctr> <int> <dbl> <dbl>
#> 1 female 531 103.0508 106.81921
#> 2 male 469 100.7591 96.57996
#> # A tibble: 5 x 4
#> bnt_sum n mn_g_iq mn_s_iq
#> <dbl> <int> <dbl> <dbl>
#> 1 0 59 98.66102 101.4576
#> 2 1 242 100.77686 100.4050
#> 3 2 406 99.83990 102.0542
#> 4 3 240 106.60417 103.5458
#> 5 4 53 106.54717 102.7925
#> # A tibble: 2 x 4
#> summerborn n mn_g_iq mn_s_iq
#> <lgl> <int> <dbl> <dbl>
#> 1 FALSE 488 99.58607 104.9508
#> 2 TRUE 512 104.25391 99.2207
#> # A tibble: 7 x 4
#> bweekdays n mn_g_iq mn_s_iq
#> <ord> <int> <dbl> <dbl>
#> 1 Sun 136 106.92647 101.8309
#> 2 Mon 152 101.52632 102.3092
#> 3 Tue 149 96.19463 102.3221
#> 4 Wed 133 101.78195 101.9248
#> 5 Thu 148 97.96622 101.3311
#> 6 Fri 139 101.65468 102.4532
#> 7 Sat 143 108.41259 101.9371
#> # A tibble: 2 x 4
#> weekend n mn_g_iq mn_s_iq
#> <lgl> <int> <dbl> <dbl>
#> 1 FALSE 721 99.7656 102.0680
#> 2 TRUE 279 107.6882 101.8853
#> # A tibble: 2 x 4
#> blood_pos n mn_g_iq mn_s_iq
#> <lgl> <int> <dbl> <dbl>
#> 1 FALSE 155 102.5290 93.89677
#> 2 TRUE 845 101.8746 103.50651
#> # A tibble: 1,000 x 6
#> bnt_1 bnt_2 bnt_3 bnt_4 g_iq s_iq
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 1 113 99
#> 2 1 1 1 NA 114 100
#> 3 0 1 0 0 108 87
#> 4 0 0 0 0 93 102
#> 5 1 0 0 0 114 83
#> 6 1 1 1 0 103 NA
#> 7 0 1 NA 0 110 85
#> 8 1 0 1 0 103 103
#> 9 0 0 0 0 107 101
#> 10 1 1 0 1 107 110
#> # ... with 990 more rows
#> # A tibble: 1,000 x 12
#> name gender bdate bweekday height blood_type bnt_1 bnt_2 bnt_3
#> <chr> <chr> <date> <chr> <int> <chr> <int> <int> <int>
#> 1 I.G. male 1968-12-14 Sat 169 O+ 1 0 0
#> 2 O.B. male 1974-04-10 Wed 181 O+ 1 1 1
#> 3 M.M. male 1987-09-28 Mon 183 A− 0 1 0
#> 4 V.J. female 1978-02-15 Wed 161 A+ 0 0 0
#> 5 O.E. male 1985-05-18 Sat 164 A− 1 0 0
#> 6 Q.W. male 1968-03-01 Fri 172 A+ 1 1 1
#> 7 H.K. male 1994-04-27 Wed 157 B− 0 1 NA
#> 8 T.R. female 1961-06-05 Mon 167 A+ 1 0 1
#> 9 F.J. male 1983-10-01 Sat 158 O+ 0 0 0
#> 10 J.R. female 1941-12-29 Mon 157 O+ 1 1 0
#> # ... with 990 more rows, and 3 more variables: bnt_4 <int>, g_iq <int>,
#> # s_iq <int>
Tasks
Basics:
Report the data dimensions and number of missing values.
Split the birth date (
bdate) variable into 3 separate variables (byear,bmonth, andbday) that denote the year, month, and day of each participant’s birth.Create a new variable
summerbornthat isTRUEwhen a person was born in summer (defined as April to September) andFALSEwhen a person was born in winter (October to March).
Assessing IVs:
Compute the current
ageof each participant (in years, taking into account today’s date).List the frequency of each blood type (
blood_type) (a) overall and (b) by gender.Compute descriptives (the means and standard deviations) of
height(a) by gender and (b) by cohort (decade of birth).Plot the distributions of heights by gender and cohort.
Assessing DVs:
Compute aggregate BNT score (a
BNTvalue from 0 to 4 points) for each participant and check whether it varies systematically as a function of gender,Check the intelligence scores
g_iqands_iq:- How many missing values?
- What are their means and ranges?
- Plot their distributions in 2 histograms.
Plot the relationship between
g_iqands_iq.Assess possible effects of numeracy (as measured by
BNT) on the 2 types of intelligence (g_iqands_iq).Assess possible effects of the variables
- age (
age), - birth season (
summerborn), - blood type (
blood_type), and - the weekday on which a person was born (
bweekday)
- age (
on the 2 types of intelligence (g_iq and s_iq) in 2 ways:
a. numercially (by computing group means) and
b. graphically (by plotting means and/or distributions).
Create exp data
Create an (experimental) dataset exp that contains typical features (e.g., multiple measurements with nested factors) of psychological studies.
n <- 10 # [n]umber of participants
set.seed(88) # for replicability
IVs <- data.frame("name" = c("Ann", "Bea", "Cat", "Deb", "Eva", "Fred", "Gary", "Hans", "Ian", "John"),
"gender" = c(rep("f", 5), rep("m", 5)),
"age" = sample(18:65, n, replace = TRUE)
)
IVs
#> name gender age
#> 1 Ann f 37
#> 2 Bea f 22
#> 3 Cat f 53
#> 4 Deb f 41
#> 5 Eva f 65
#> 6 Fred m 65
#> 7 Gary m 19
#> 8 Hans m 54
#> 9 Ian m 50
#> 10 John m 64
## (a) within-subjects conditions (with multiple tasks per person):
DVs <- data.frame("task_1" = rep(c("red", "blue"), 5),
# "pos_1" = rep(1, n),
"time_1" = sample(10:99, n),
"task_2" = rep(c("blue", "red"), 5),
# "pos_2" = rep(2, n),
"time_2" = sample(10:99, n)
)
DVs
#> task_1 time_1 task_2 time_2
#> 1 red 59 blue 29
#> 2 blue 76 red 75
#> 3 red 58 blue 83
#> 4 blue 25 red 89
#> 5 red 11 blue 88
#> 6 blue 38 red 69
#> 7 red 18 blue 44
#> 8 blue 99 red 98
#> 9 red 71 blue 56
#> 10 blue 91 red 41
## (b) between-subjects conditions (with separate variables):
# DVs2 <- data.frame("cond" = c(rep("A", n/2), rep("B", n/2)),
# "A.num1" = c(sample(1:7, n/2), rep(NA, n/2)),
# "B.num1" = c(rep(NA, n/2), sample(3:9, n/2)),
# "A.chr1" = c(sample(c("ABBA", "Beatles"), n/2, replace = TRUE), rep(NA, n/2)),
# "B.chr1" = c(rep(NA, n/2), sample(c("ABBA", "Pink Floyd"), n/2, replace = TRUE))
# )
DVs2 <- data.frame("cond" = rep(c("A", "B"), n/2))
DVs2$A.num1 <- NA
DVs2$A.num1[DVs2$cond == "A"] <- c(sample(1:6, n/2, replace = TRUE))
DVs2$B.num1 <- NA
DVs2$B.num1[DVs2$cond == "B"] <- c(sample(4:9, n/2, replace = TRUE))
DVs2$A.chr1 <- NA
DVs2$A.chr1[DVs2$cond == "A"] <- c(sample(c("Abba", "Beatles"), n/2, replace = TRUE))
DVs2$B.chr1 <- NA
DVs2$B.chr1[DVs2$cond == "B"] <- c(sample(c("Beatles", "Zappa"), n/2, replace = TRUE))
DVs2
#> cond A.num1 B.num1 A.chr1 B.chr1
#> 1 A 2 NA Beatles <NA>
#> 2 B NA 9 <NA> Beatles
#> 3 A 5 NA Abba <NA>
#> 4 B NA 6 <NA> Beatles
#> 5 A 4 NA Abba <NA>
#> 6 B NA 8 <NA> Beatles
#> 7 A 3 NA Beatles <NA>
#> 8 B NA 8 <NA> Zappa
#> 9 A 5 NA Abba <NA>
#> 10 B NA 9 <NA> Beatles
# Note that DVs encodes order (or chronological trial position)
# implicitly (as 1st vs. 2nd entry) for every case.
## Combine IVs and DVs:
exp <- cbind(IVs, DVs)
exp2 <- cbind(IVs, DVs2)
# exp
# exp2
dim(exp)
#> [1] 10 7
dim(exp2)
#> [1] 10 8Use exp data
Basic manipulations of exp data (in base R):
exp.t <- exp # temporal copy
# 1. Adding 2 new variables/columns to a df by assigning/using it:
exp.t$id <- 1:nrow(exp.t)
exp.t$bnt <- sample(1:4, n, replace = TRUE)
# exp.t
# 2. Swap some columns (e.g., putting id to front):
id.col <- which(names(exp.t) == "id") # determine id column
bnt.col <- which(names(exp.t) == "bnt") # determine bnt column
exp.t <- exp.t[ , c(id.col, 1:(id.col - 1), bnt.col)]
exp.t
#> id name gender age task_1 time_1 task_2 time_2 bnt
#> 1 1 Ann f 37 red 59 blue 29 1
#> 2 2 Bea f 22 blue 76 red 75 1
#> 3 3 Cat f 53 red 58 blue 83 2
#> 4 4 Deb f 41 blue 25 red 89 4
#> 5 5 Eva f 65 red 11 blue 88 1
#> 6 6 Fred m 65 blue 38 red 69 1
#> 7 7 Gary m 19 red 18 blue 44 2
#> 8 8 Hans m 54 blue 99 red 98 3
#> [ reached getOption("max.print") -- omitted 2 rows ]
# 3. Sorting cases:
exp.t[order(exp.t$task_1, exp.t$bnt),] # sort cases by task_1 and bnt values
#> id name gender age task_1 time_1 task_2 time_2 bnt
#> 2 2 Bea f 22 blue 76 red 75 1
#> 6 6 Fred m 65 blue 38 red 69 1
#> 8 8 Hans m 54 blue 99 red 98 3
#> 10 10 John m 64 blue 91 red 41 3
#> 4 4 Deb f 41 blue 25 red 89 4
#> 1 1 Ann f 37 red 59 blue 29 1
#> 5 5 Eva f 65 red 11 blue 88 1
#> 9 9 Ian m 50 red 71 blue 56 1
#> [ reached getOption("max.print") -- omitted 2 rows ]
# 4. Reversing cases and variables:
# Reversing the (arbitrary) order of all cases and all variables in df:
exp.t[rev(1:nrow(exp.t)), rev(1:ncol(exp.t))]
#> bnt time_2 task_2 time_1 task_1 age gender name id
#> 10 3 41 red 91 blue 64 m John 10
#> 9 1 56 blue 71 red 50 m Ian 9
#> 8 3 98 red 99 blue 54 m Hans 8
#> 7 2 44 blue 18 red 19 m Gary 7
#> 6 1 69 red 38 blue 65 m Fred 6
#> 5 1 88 blue 11 red 65 f Eva 5
#> 4 4 89 red 25 blue 41 f Deb 4
#> 3 2 83 blue 58 red 53 f Cat 3
#> [ reached getOption("max.print") -- omitted 2 rows ]Same steps in the tidyverse:
library(tidyverse)
exp.t2 <- exp # temporal copy
# 1. Adding 2 new variables/columns to a df by assigning/using it:
exp.t2 <- exp.t2 %>%
mutate(id = 1:nrow(exp.t2),
bnt = sample(1:4, n, replace = TRUE))
# exp.t2
# 2. Swap some columns (e.g., putting id to front):
exp.t2 <- exp.t2 %>%
select(id, everything())
# exp.t2
# 3. Sorting cases:
exp.t2 <- exp.t2 %>%
arrange(task_1, bnt)
# exp.t2
# 4. Reversing cases and variables:
# Reversing the (arbitrary) order of all cases and all variables in df:
exp.t2 %>%
mutate(row = 1:nrow(exp.t2)) %>% # add helper column
arrange(desc(row)) %>%
select(-row) %>% # remove helper column again
select(rev(names(exp.t2)))
#> bnt time_2 task_2 time_1 task_1 age gender name id
#> 1 3 56 blue 71 red 50 m Ian 9
#> 2 3 44 blue 18 red 19 m Gary 7
#> 3 3 83 blue 58 red 53 f Cat 3
#> 4 2 88 blue 11 red 65 f Eva 5
#> 5 1 29 blue 59 red 37 f Ann 1
#> 6 3 41 red 91 blue 64 m John 10
#> 7 3 75 red 76 blue 22 f Bea 2
#> 8 2 69 red 38 blue 65 m Fred 6
#> [ reached getOption("max.print") -- omitted 2 rows ]Counting tasks and points
This file contains 13 tasks for a total of 69 points (plus some possible bonus points).
[Last update on 2018-07-12 12:07:09 by hn.]